Search CORE

182 research outputs found

Lecture 09: Hierarchically Low Rank and Kronecker Methods

Author: Yokota Rio
Publication venue: ScholarWorks@UARK
Publication date: 09/04/2021
Field of study

Exploiting structures of matrices goes beyond identifying their non-zero patterns. In many cases, dense full-rank matrices have low-rank submatrices that can be exploited to construct fast approximate algorithms. In other cases, dense matrices can be decomposed into Kronecker factors that are much smaller than the original matrix. Sparsity is a consequence of the connectivity of the underlying geometry (mesh, graph, interaction list, etc.), whereas the rank-deficiency of submatrices is closely related to the distance within this underlying geometry. For high dimensional geometry encountered in data science applications, the curse of dimensionality poses a challenge for rank-structured approaches. On the other hand, models in data science that are formulated as a composition of functions, lead to a Kronecker product structure that yields a different kind of fast algorithm. In this lecture, we will look at some examples of when rank structure and Kronecker structure can be useful

ScholarWorks@UARK

Petascale turbulence simulation using a highly parallel fast multipole method on GPUs

Author: Barnes
Chatelain
Cheng
Cottet
Davidson
Dehnen
Gingold
Greengard
Hamada
Ishihara
Kenji Yasuoka
L.A. Barba
Lambert
Rahimian
Rio Yokota
Salmon
Sundar
Tetsu Narumi
Warren
Warren
Yokokawa
Yokota
Yokota
Yokota
Yokota
Yokota
Publication venue: 'Elsevier BV'
Publication date: 03/09/2012
Field of study

This paper reports large-scale direct numerical simulations of homogeneous-isotropic fluid turbulence, achieving sustained performance of 1.08 petaflop/s on gpu hardware using single precision. The simulations use a vortex particle method to solve the Navier-Stokes equations, with a highly parallel fast multipole method (FMM) as numerical engine, and match the current record in mesh size for this application, a cube of 4096^3 computational points solved with a spectral method. The standard numerical approach used in this field is the pseudo-spectral method, relying on the FFT algorithm as numerical engine. The particle-based simulations presented in this paper quantitatively match the kinetic energy spectrum obtained with a pseudo-spectral method, using a trusted code. In terms of parallel performance, weak scaling results show the fmm-based vortex method achieving 74% parallel efficiency on 4096 processes (one gpu per mpi process, 3 gpus per node of the TSUBAME-2.0 system). The FFT-based spectral method is able to achieve just 14% parallel efficiency on the same number of mpi processes (using only cpu cores), due to the all-to-all communication pattern of the FFT algorithm. The calculation time for one time step was 108 seconds for the vortex method and 154 seconds for the spectral method, under these conditions. Computing with 69 billion particles, this work exceeds by an order of magnitude the largest vortex method calculations to date

arXiv.org e-Print Archive

Crossref

Mixed-Precision Random Projection for RandNLA on Tensor Cores

Author: Ootomo Hiroyuki
Yokota Rio
Publication venue
Publication date: 10/04/2023
Field of study

Random projection can reduce the dimension of data while capturing its structure and is a fundamental tool for machine learning, signal processing, and information retrieval, which deal with a large amount of data today. RandNLA (Randomized Numerical Linear Algebra) leverages random projection to reduce the computational complexity of low-rank decomposition of tensors and solve least-square problems. While the computation of the random projection is a simple matrix multiplication, its asymptotic computational complexity is typically larger than other operations in a RandNLA algorithm. Therefore, various studies propose methods for reducing its computational complexity. We propose a fast mixed-precision random projection method on NVIDIA GPUs using Tensor Cores for single-precision tensors. We exploit the fact that the random matrix requires less precision, and develop a highly optimized matrix multiplication between FP32 and FP16 matrices -- SHGEMM (Single and Half-precision GEMM) -- on Tensor Cores, where the random matrix is stored in FP16. Our method can compute Randomized SVD 1.28 times faster and Random projection high order SVD 1.75 times faster than baseline single-precision implementations while maintaining accuracy.Comment: PASC'2

arXiv.org e-Print Archive